class: inverse, middle, center
Download data csv file link and pay attention to where it downloads on your computer
Open the data file penguins.csv and look at it
Steps for a new data analysis project or homework:
penguins.csv) into this folder.Use projects to keep everything together (read this) - A project keeps track of your coding environment and file structure. - Create an RStudio project for each data analysis project, for each homework assignment, etc. - A project is associated with a directory folder + Keep data files there + Keep code scripts there; edit them, run them in bits or as a whole + Save your outputs (plots and cleaned data) there - Only use relative paths, never absolute paths + relative (good): read.csv("data/mydata.csv") + absolute (bad): read.csv("/home/yourname/Documents/stuff/mydata.csv")
Advantages of using projects - standardizes file paths - keep everything together - a whole folder can be easily shared and run on another computer - when you open the project everything is as you left it
Let’s go through it together. (Read this for more)
.pull-left-60[ - Click]
Bonus lessons
and your workspace folder location will be showing at the top (i.e. Home/Desktop/workshop_practice)
Open penguins.csv in Rstudio and look at it
penguins.csv in the Files pane, click View File
We will show you how to store and use this data in R as a data frame
Currently it is still just a file in your folder.
Add this code to the setup chunk in the Rmd and run that chunk:
.pull-left[
library(tidyverse)
## ── Attaching packages ────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.1
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
] .pull-right[
]
Now we can use functions in these packages, such as read_csv() and %>% and mutate() and tabyl()
penguins <- read_csv("penguins.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## species = col_character(),
## island = col_character(),
## bill_length_mm = col_double(),
## bill_depth_mm = col_double(),
## flipper_length_mm = col_double(),
## body_mass_g = col_double(),
## sex = col_character(),
## year = col_double()
## )
## Parsed with column specification:
## cols(
## id = col_double(),
## age = col_character(),
## sex = col_character(),
## grade = col_character(),
## race4 = col_character(),
## bmi = col_double(),
## weight_kg = col_double(),
## text_while_driving_30d = col_character(),
## smoked_ever = col_character(),
## bullied_past_12mo = col_logical()
## )
# Run in console:
View(penguins)
# Can also view the data by clicking on its name in the Environment tab
Try knitting it!
.pull-left-60[ Vectors vs. data frames: a data frame is a collection (or array or table) of vectors
penguins
## # A tibble: 342 x 9
## id species island bill_length_mm bill_depth_mm flipper_length_…
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1689 Adelie Torge… 39.1 18.7 181
## 2 4274 Adelie Torge… NA 17.4 186
## 3 4539 Adelie Torge… 40.3 18 195
## 4 2435 Adelie Torge… 36.7 19.3 193
## 5 2326 Adelie Torge… 39.3 20.6 190
## 6 2637 Adelie Torge… 38.9 17.8 181
## 7 4443 Adelie Torge… NA 19.6 195
## 8 2102 Adelie Torge… 34.1 18.1 193
## 9 2975 Adelie Torge… 42 20.2 190
## 10 3966 Adelie Torge… 37.8 17.1 186
## # … with 332 more rows, and 3 more variables: body_mass_g <dbl>, sex <chr>,
## # year <dbl>
] .pull-right-40[
Allows different columns to be of different data types (i.e. numeric vs. text)
Both numeric and text can be stored within a column (stored together as text).
Vectors and data frames are examples of objects in R.
| type | description |
|---|---|
| double/numeric | numbers that are decimals |
| character | text, “strings” |
| integer | integer-valued numbers |
| factor | categorical variables stored with levels (groups) |
| logical | boolean (TRUE, FALSE) |
read_csv() to read in your data setsint = integer as a column type, you can treat it as a double for most intents and purposes.NA?glimpse(penguins) # structure of data
## Rows: 342
## Columns: 9
## $ id <dbl> 1689, 4274, 4539, 2435, 2326, 2637, 4443, 2102, 297…
## $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "…
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen",…
## $ bill_length_mm <dbl> 39.1, NA, 40.3, 36.7, 39.3, 38.9, NA, 34.1, 42.0, 3…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.…
## $ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 1…
## $ body_mass_g <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 425…
## $ sex <chr> "male", "female", "female", "female", "male", "fema…
## $ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
summary(penguins)
## id species island bill_length_mm
## Min. :1001 Length:342 Length:342 Min. :32.10
## 1st Qu.:2031 Class :character Class :character 1st Qu.:39.45
## Median :2984 Mode :character Mode :character Median :44.70
## Mean :3031 Mean :44.00
## 3rd Qu.:4073 3rd Qu.:48.52
## Max. :4969 Max. :59.60
## NA's :6
## bill_depth_mm flipper_length_mm body_mass_g sex
## Min. :13.10 Min. :172.0 Min. :2700 Length:342
## 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550 Class :character
## Median :17.30 Median :197.0 Median :4050 Mode :character
## Mean :17.15 Mean :200.9 Mean :4202
## 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750
## Max. :21.50 Max. :231.0 Max. :6300
##
## year
## Min. :2007
## 1st Qu.:2007
## Median :2008
## Mean :2008
## 3rd Qu.:2009
## Max. :2009
##
Tibble truncates the output to ten rows, so you can’t actually see it all.
penguins
## # A tibble: 342 x 9
## id species island bill_length_mm bill_depth_mm flipper_length_…
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1689 Adelie Torge… 39.1 18.7 181
## 2 4274 Adelie Torge… NA 17.4 186
## 3 4539 Adelie Torge… 40.3 18 195
## 4 2435 Adelie Torge… 36.7 19.3 193
## 5 2326 Adelie Torge… 39.3 20.6 190
## 6 2637 Adelie Torge… 38.9 17.8 181
## 7 4443 Adelie Torge… NA 19.6 195
## 8 2102 Adelie Torge… 34.1 18.1 193
## 9 2975 Adelie Torge… 42 20.2 190
## 10 3966 Adelie Torge… 37.8 17.1 186
## # … with 332 more rows, and 3 more variables: body_mass_g <dbl>, sex <chr>,
## # year <dbl>
We showed this already, very handy to see all data. Run in console since it’s more interactive.
View(penguins)
.pull-left-40[
dim(penguins)
## [1] 342 9
nrow(penguins)
## [1] 342
ncol(penguins)
## [1] 9
]
.pull-right-60[
names(penguins)
## [1] "id" "species" "island"
## [4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm"
## [7] "body_mass_g" "sex" "year"
]
head(penguins)
## # A tibble: 6 x 9
## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1689 Adelie Torge… 39.1 18.7 181 3750
## 2 4274 Adelie Torge… NA 17.4 186 3800
## 3 4539 Adelie Torge… 40.3 18 195 3250
## 4 2435 Adelie Torge… 36.7 19.3 193 3450
## 5 2326 Adelie Torge… 39.3 20.6 190 3650
## 6 2637 Adelie Torge… 38.9 17.8 181 3625
## # … with 2 more variables: sex <chr>, year <dbl>
tail(penguins)
## # A tibble: 6 x 9
## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1947 Chinst… Dream 45.7 17 195 3650
## 2 4452 Chinst… Dream 55.8 19.8 207 4000
## 3 2420 Chinst… Dream 43.5 18.1 202 3400
## 4 4861 Chinst… Dream 49.6 18.2 193 3775
## 5 4865 Chinst… Dream 50.8 19 210 4100
## 6 4162 Chinst… Dream 50.2 18.7 198 3775
## # … with 2 more variables: sex <chr>, year <dbl>
head(penguins, 3)
## # A tibble: 3 x 9
## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1689 Adelie Torge… 39.1 18.7 181 3750
## 2 4274 Adelie Torge… NA 17.4 186 3800
## 3 4539 Adelie Torge… 40.3 18 195 3250
## # … with 2 more variables: sex <chr>, year <dbl>
tail(penguins, 1)
## # A tibble: 1 x 9
## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4162 Chinst… Dream 50.2 18.7 198 3775
## # … with 2 more variables: sex <chr>, year <dbl>
.pull-left-60[
Specific cell: DatSetName[row#, column#]
# Second row, Third column
penguins[2, 3]
## # A tibble: 1 x 1
## island
## <chr>
## 1 Torgersen
Entire row: DatSetName[row#, ]
# Second row
penguins[2,]
## # A tibble: 1 x 9
## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4274 Adelie Torge… NA 17.4 186 3800
## # … with 2 more variables: sex <chr>, year <dbl>
]
.pull-right-40[ Entire col: DatSetName[, column#]
# Third column
penguins[, 3]
## # A tibble: 342 x 1
## island
## <chr>
## 1 Torgersen
## 2 Torgersen
## 3 Torgersen
## 4 Torgersen
## 5 Torgersen
## 6 Torgersen
## 7 Torgersen
## 8 Torgersen
## 9 Torgersen
## 10 Torgersen
## # … with 332 more rows
]